Skip to content

[pull] main from triggerdotdev:main#208

Merged
pull[bot] merged 1 commit into
Dustin4444:mainfrom
triggerdotdev:main
Jun 10, 2026
Merged

[pull] main from triggerdotdev:main#208
pull[bot] merged 1 commit into
Dustin4444:mainfrom
triggerdotdev:main

Conversation

@pull

@pull pull Bot commented Jun 10, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )


This change is Reviewable

…when the read replica lags (#3889)

## Summary

When `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` is on,
`RunEngine.getSnapshotsSince` reads from the read replica. During write
spikes the replica can briefly lag, so the snapshot id a runner just
learned from the writer isn't visible there yet: the lookup threw, the
worker route returned a 500, and the runner waited for its next poll —
turning sub-second snapshot notifications into poll-interval latency
exactly when things are busiest. This PR makes the flag safe to enable:
a replica miss of the since snapshot gets one jittered retry on the
replica (most lag windows are shorter than the ~50–200ms wait, so the
writer is never touched), then falls back to the primary, observed via a
new `run_engine.snapshots_since.replica_miss` counter with an `outcome`
attribute (`replica_retry` vs `primary`). Only genuine misses — absent
on the primary too — remain errors.

## Design

- `getExecutionSnapshotsSince` now throws a typed
`ExecutionSnapshotNotFoundError` so the engine can distinguish the
expected lag miss from real failures. The message string is unchanged
and the error never leaves the engine.
- The recovery path only engages when the flag is on, a distinct replica
client is configured, and no transaction client was passed. With the
flag off, the path is behaviorally identical to before.
- Retry delay bounds are configurable
(`RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS`/`MAX_MS`, default
50/200; `MAX_MS=0` skips the replica retry and goes straight to the
primary).
- The warn log fires only when the primary serves the read (the writer
spill is the operationally interesting event); replica-retry recoveries
are counted but quiet. A permanently-missing snapshot id stays an
error-level failure with a `failedDuring` field, so lag metrics aren't
polluted by bogus ids.
- Stale-tail lag (replica has the since snapshot but not newer rows)
deliberately still returns the replica's view; the next poll catches up.
- The since-snapshot anchor lookup is now scoped to the polled run
(`where: { id, runId }`), so a snapshot id from a different run raises
not-found instead of silently anchoring a too-wide window of the run's
snapshots.

## Test plan

All vitest + testcontainers, no mocks. A new `schemaOnlyPrisma` fixture
(migrated-but-empty clone database) simulates a replica that hasn't
caught up, and a real in-memory OTel meter pins the counter semantics
per outcome.

- [x] Replica catches up during the jittered retry window → served by
the replica, `outcome=replica_retry` = 1, primary never consulted
- [x] Replica permanently missing the since snapshot → served by the
primary, `outcome=primary` = 1
- [x] Snapshot missing on both replica and primary → null, counter = 0
- [x] Replica has the since snapshot but lags by one → the replica's
view is served, no fallback (verified discriminating power: the test
fails if reads secretly hit the primary)
- [x] Flag off with a replica configured → primary serves the read
- [x] Transaction client provided → bypasses the replica entirely
- [x] Since snapshot belonging to a different run → null
- [x] Existing getSnapshotsSince + waitpoints suites green; run-engine,
testcontainers, and webapp typechecks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators Jun 10, 2026
@pull pull Bot added the ⤵️ pull label Jun 10, 2026
@pull pull Bot merged commit 6afc9bf into Dustin4444:main Jun 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant